{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<center><h1> Information theory </h1></center>\n",
    "\n",
    "# 1. Entropy\n",
    "The entropy of a random variable $X$ with distribution $p$ is denoted by $H(X)$, which measures its uncertainty.For discrete variable, it's defined by\n",
    "$$\n",
    "H(X)=-\\sum_{k=1}^K p(X=k)log_2{p(X=k)}\n",
    "$$\n",
    "The discrete distribution with maximium entropy is the uniform distribution, which is with largest uncertainty.\n",
    "\n",
    "For continuous variable, the entropy is defined by **differential entropy**\n",
    "$$\n",
    "\tH(X)=-\\int p(x)log_2{p(x)}dx\t\t\n",
    "$$\n",
    "The continous distribution with maximium entropy is the Gaussian distribution ,which is the most nature form of continous probability distribution.\n",
    "# 2. KL divergence\n",
    "The **KL divergence** or **relative entropy** is used to measure the dissimilarity of two probability distributions $p$ and $q$.\n",
    "\\begin{align*}\n",
    "KL(p,q) &=\\sum_{k=1}^K p_k log_2{\\frac{p_k}{q_k}}\\\\\n",
    "\t\t &=-H(p)+H(p,q)\n",
    "\\end{align*}\n",
    "where $H(p,q)$ is called **cross entropy**\n",
    "\\begin{equation*}\n",
    "\tH(p,q)=-\\sum_{k=1}^K p_k log_2 q_k\n",
    "\\end{equation*}\n",
    "We can derive from the **Jensen's inequality** that $KL(p,q) \\geq 0$\n",
    "\\begin{align}\n",
    "KL(p,q) &=-\\sum_{k=1}^K p_klog_2\\frac{q_k}{p_k} \\\\\n",
    "        &>=-log_2(\\sum_{k=1}^K p_k\\frac{q_k}{p_k}) \\\\\n",
    "        &=0\n",
    "\\end{align}\n",
    "# 3. Mutual information\n",
    "The mutual information (MI) measures how much knowing one variable $X$ tells us about the other variable $Y$. It's the **KL divergence** between the joint distribution $p(X,Y)$ and the factored distribution $p(X)p(Y)$.\n",
    "$$\n",
    "MI(X;Y)=KL(p(X,Y),(X)p(Y))=\\sum_x\\sum_y p(x,y)log\\frac{p(x,y)}{p(x)p(y)}\n",
    "$$\n",
    "The $MI(X;Y) \\geq 0$ and is equal to 0 only when $p(X,Y)=p(X)p(Y)$ which means $X$ and $Y$ are indenpendent.\n",
    "We can also write $MI$ as following\n",
    "$$\n",
    "MI(X;Y)=H(X)-H(X|Y)=H(Y)-H(Y|X)\n",
    "$$\n",
    "We can interpret $MI$ between $X$ and $Y$ as the reduction in uncertainty about $X$ after observing $Y$. The conditional entropy is defined as \n",
    "$$\n",
    "H(Y|X)=\\sum_x H(Y|X=x)\\,p(x)\n",
    "$$"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "The mutual information between each feature and the target is:\n",
      "[0.50056184 0.23107944 0.9896111  0.97301189]\n"
     ]
    }
   ],
   "source": [
    "from sklearn.datasets import load_iris\n",
    "from sklearn.feature_selection import mutual_info_classif\n",
    "iris=load_iris()\n",
    "mi=mutual_info_classif(iris.data,iris.target,discrete_features=False)\n",
    "print(\"The mutual information between each feature and the target is:\")\n",
    "print(mi)"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python [default]",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.4"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
